MODULE-2 


SUPERVISED LEARNING 
NETWORK 


Module — 2 (Supervised Learning Network) 


Perceptron Networks— Learning rule, Training and testing algorithm. Adaptive Linear Neuron— 
Architecture, Training and testing algorithm. Back propagation Network — Architecture, Training 
and testing algorithm. 


LAYERS IN NEURAL NETWORK 


> The input layers 
e Introduces input values into the network. 
¢ No activation function or other processing. 


> The hidden layer(s): 
¢ Performs classification of features. 
¢ Two hidden layers are sufficient to solve any problem. 
e Features imply more layers may be better. 


> The output layers 
¢ Functionally is just lige the hidden layers. 
¢ Outputs are passed on to the world outside the neural 
network. 


DEFINITION OF SUPERVISED LEARNING NETWORKS 
> Training and test data sets 


> Training set; input & target are specified 


Training process 


PERCEPTRON NETWORKS 


¢ Single layer feed forward network 
¢ First neural network learning model in the 1960's 
¢Simple and limited (single layer models) 


¢ Basic concepts are similar for multi-layer models so this is 
a good learning tool 


¢ Still used in many current applications (modems, etc.) 
¢ Called Simple perceptron 
¢ Discovered by Block in 1962 


PERCEPTRON NETWORKS 


v Key points to be noted in perceptron network are 


“*Consist of 3 units 
Sensory unit(input unit) 
UAssociator Unit(hidden unit) 
Response unit(output unit) 


“*Sensory units are connected to associator unit with a fixed 
weight having values 1, 0 or -1 which are assigned at random 


“*Binary activation functions are used in the sensory unit and 
associator unit 


“*The response unit has an activation of 1,0 or -1 


“*The binary step with fixed threshold @is used as an activation 
for associator 

“*The output signal that are send from the associator to response 
are only binary 


“*The output of perceptron network is given by 
Y= f(Vind 


“*Where f(y,,) is the activation function and is defined as 


f(Vin)= 1 If y,, > 0 
0 If - 0 <=y,, <= 0 
-1 If y,, < - 0 
“*The perceptron learning rule is used in the weight updation 


between the associator unit and the response unit 


“*For each training input, the net will calculate the response and 
it will determine whether or not an error has occurred 


“*The error calculations are based on the comparison of values 
for targets with those of calculated output 


“*The weights on the connection from the unit that send the 
non zero signal will get adjusted suitably 


“Weight will be adjusted on the basis of the learning rule if an 
error has occurred for a particular training pattern 


wi(new)=wi(old) tat xi 
b(new)=b(old) + at 


“*lf no error occurred then no weight updation and hence the 
training process may be stopped 


“*In the above equation the target value ‘t’ is +1 or -1 
“ea. is the learning rate 


¢ In general, these learning rules begin with an initial guess at the 
weight values and 


¢ then successive adjustments are made on the basis of the evaluation 
of an objective function. 


¢ Eventually, the learning rules reach a near optimal or optimal 
solution in a finite number of steps. 


Output Oor1 Output 0 Desired 
orl Output 


yl t1 


Fixed weight 
value of 0,1,-1 
at random 


Sensory unit 
Sensory grid 
representing 
any pattern 


en 


Associator Unit Response Unit 


¥ Sensory unit 
>can be a 2D matrix of 400 photo detectors 


>These detectors provide a binary electrical signal if the input signal is 
found to exceed a certain value of threshold 


> Also these detectors are connected randomly with the associator unit 
v Associator unit 


> consist of a set of sub circuits called feature predicates 
> Hardwired to detect the specific feature of a pattern 


>For a particular feature each predicate is examined with a 
few or all of the responses of sensory unit 


>The result from these predicate unit is also binary 


v Response Unit 
>» Contain pattern recognizer or perceptron 
>The weight present in the input layer are all fixed 
>» While weight in the response layer are trainable 


LEARNING ALGORITHM 


> 


Epoch : Presentation of the entire training set to the neural 
network. 


In the case of the AND function, an epoch consists of four 
sets of inputs being presented to the network (i.e. [0,0], 


[0,1], [1,0], [1,1]). 


Error: The error value is the amount by which the value 
output by the network differs from the target value. For 
example, if we required the network to output O and it 
outputs 1, then Error = -1. 


PERCEPTRON LEARNING RULE 


V In learning rule, the learning signal is the difference 
between the desired and actual response of the 
neuron 


¥V Learning rule explained as follows 


o Consider a finite ‘n’ number of input training vectors, with their 
associated target values x(n) and t(n), where n ranges from 1 to n 


o The target is either +1 or -1 


o The output y is obtained on the basis of the net input calculated and 
activation function being applied over the net input 


1 If yj, > @ 
Y=f(Vn)= 0 If- 0 <=y,, <= 
-1 If Yin <- ) 


oWeight updation in case of perceptron learning is as shown 
Ify~t then 
w(new)= w(old)+a t x (a learning rate) 
Else we have 
w(new)= w(old) 
oWeight can be intialized to any values in this method 
oPerceptron rule convergence theorom states that 


"If there is a weight vector W , such that f (x (n ) W )= t(n) , for all n, 
then for any starting vector W1, the perceptron learning rule converges 
to a weight vector that gives the correct response for all training 
patterns and this learning take place within a finite number of steps 
provided that the solution exist” 


ARCHITECTURE 


¥ The output obtained from the associator unit is a 
binary vector and hence that the output is taken as 
input signal to the response unit 


YHere the weight between the sensory unit and 
associator unit are fixed 


Y Weight between the associator unit and response 
unit output unit can be adjusted 


VAs a result the discussion of the network is limited to 
a single portion 


Y Thus associator unit behaves like input unit 


x1 


xi 


xn ¥Y There are n input neuron and one output neuron and a 


bias 

Y The input layer and output layer are connected through 
the direct communication link which is associated with 
weight 


¥ The goal of perceptron network is to classify the input 
pattern as the member or not a member of a particular 
class 


TRAINING ALGORITHM 


> Adjust neural network weights to map inputs to outputs. 


> Use a set of sample patterns where the desired output 
(given the inputs presented) is Rnown. 


> The purpose is to learn to 
e Recognize features which are common to good and bad 


FLOW CHART OF A TRAINING PROCESS 


START 


Initialize weight and bias 


Set ao (0 to 1) 


For No 


Each 
S:t 


Yes 
Activate input unit xi=si 


Calculate net input y,,, 


Apply activation y=f(y,, ) 
No 


If y!=t | 
Yes 


wi(new)=wi(old)+att xi wi(new)=wi(old) 
b(new)=b(old)+ at b(new)=b(old) 


If weight 


changes 


PERCEPTRON TRAINING ALGORITHM FOR 
SINGLE OUTPUT CLASS 


$tep Os Initialize the weight and bias (to zero). Also initialize learning rate a (0 < a <=1 
) For simplicity ao set to 1 


$tep 1s Perform step 2- 6 until the final stopping condition is false 
$tep 2s Perform step 3-5 for each training pair indicated by s:t 
$tep 3s Input layer containing the input unit is applied with identity activation functions 
X= S; 
$tep 4: Calculate the output of the network. To do so first obtain the net input 
Vin = b+ 2 XiWj 
‘n’ is the number of neurons in the layer 


Then apply the activation function over net i tt to get the output 


Yin 
y=f(y;, ) = 1) If - 0 <=y,, <= 0 
-1 If y,,<- 0 


$tep 5: Weight and bias adjustment: Compare values of actual and desired output 


If y#t then wi(new)=wi(old)+a t xi 
b(new)=b(old)+ at 
else wi(new)=wi(old) 


b(new)=b(old) 


$tep 63 Train the network until there is no weight change. This is the stopping condition 
for the network. If this condition is not met , then start again from step 2 


¢ Notes: 
e Learning occurs only when a sample has y !=t 
¢ Otherwise, it’s similar to Hebb Learning rule 
¢ Two loops, a completion of the inner loop (each sample is used once) is called 
an epoch 
¢ Stop condition 
¢ When no weight is changed in the current epoch, or 
¢ When pre-determined number of epochs is reached 


PERCEPTRON TRAINING ALGORITHM FOR 
MULTIPLE OUTPUT CLASS 


$tep Os Initialize the weight and bias (to zero). Also initialize learning rate a (0 < a <=1 
For simplicity « set to 1 


$tep 1: Perform step 2- 6 until the final stopping condition is false 
$tep 2s Perform step 3-5 for each training pair indicated by s:t 
$tep 3s Set the activation of each input unit i=1 to n 
X= S; 
$tep 4: Calculate the output response of each output from j=1 to m . To do so first 
obtain the net input 


‘n’ is the number of neurons in the layer 


Then apply the activation function over net input to get the output 


¥ jin) = 1 If Yin > 8 
0 If - 0 <=yj, <= 6 
$tep 5: Weight and bias adjustment: from j=1 to m and4=1 to n If y,,, < - 6 
Ify,#t, then wi (new)=wi ,(old)+at, xi, 
b (new)=b old)+ at; 
else wi j(new)=wi j(old) 


b , (new)=b ; (old) 


$tep 63 Train the network until there is no weight change. This is the stopping condition 
for the network. If this condition is not met, then start again from step 2 


PERCEPTRON NETWORK TESTING ALGORITHM 


$tep Os The initial weight used here are taken from the training 
algorithms ( the final weight obtained during training) 


$tep 1s For each input vector X to be classified perform step 2 — 3 
$tep 2s Set activation of the input unit 
$tep 3s Obtain the response of the output unit 


Vin = 2 XiW; 1 = If yj, > 0 
0 If - 6 <=y,, <= 0 


y=f(y,, —|-1 4 Yn <-0 


V In case of perceptron network, it can be used for linear 
seperability concept 


VY Here the separating line may be based on the values of 
threshold 


Y The condition for separating the response from region of 
positive to region of zero is 


W,X,tW,x, tb > 0 
Y The condition for separating the response from region of zero 
to region of negative is 
W,X,tw x, th < - 0 
Y The condition above are stated for a single layer perceptron 


network with two input neuron and one output neuron and 
one bias 


Implement AND function using perceptron 
network for bipolar input and targets 


1 1 1 
1) -1 -1 
=t 1 =i 
-4 -1 -1 


¢ X1=1, X2=1 andt = 1, with weights and bias, w1 = 0, W2 = 0 and 
b=0 


¢ Calculate the net input 


Yin = b+ X, Wi+X2 W> 
=0+1 x 0+1x 0=0 


Input Target _Net input Calculated 
output 
x x, (t) (y,,) _y) 
EPOCH-1 
1 1 1 0 0 


¢ Yin = b+ X, Wi1+X2 Wp 
=0+1 x 0+1x 0=0 


1 if y,>0 
y=fly, =; 0 if y,,=0 
-1 if Yin 0. 
° Y=0 
° Is t==y 
AW1= atx, 
AW2= atx2 


Ab= at 


weight hences 


Weights 
Aw, Aw, Ab 6 pe 


wi(new)=wi(old) +atxi 
w1(new)=0+1x1x1 
W1(new)=1 
w2(new)=w2(old) +atx2 
= 0+1x1x1=1 

b(new)=b(old)+Qt 


0+1x1 
=] 


Input Target. tet pat Calculated Weight Ehanges Weights 
x, x, (t) y,) oe aw, Aw, Ab W, w, b 
(0 0 0) 
EPOCH-1 

1 1 1 0 0 1 1 1 1 1 1 

1 —1 —1 1 -1 1 -1 0 2 0 
—1 1 —1 2 1 +1 -1 —1 1 1 -1 
-1 —1 -1 -3 -—1 0 0 0 1 1 —1 

EPOCH-2 

1 1 1 1 1 0 0 0 1 1 —1 

1 —1 —1 -1 —1 0 0 0 1 1 -1 
-1 1 -1 -1 -1 0 0 0 1 1 -1 
-1 -1 -1 -3 -1 0 0 0 1 1 -1 


* Find the weights required to perform the following classification using perceptron 
network. 
¢ The vectors (1,1, 1, 1) and (- 1, 1-1, - 1) are belonging to the class 1, vectors (1, 1, 1, - 1} 


and (1,-1, - 1, 1) are belonging to the class -1. 


¢ Assume learning rate as 1 
¢ Initial weight =O 


Input Target 

x, x, x, 3 b (t) 
1 1 1 1 1 1 
1 1 -1 -1 ~~ 1 1 
1 1 1 = 1 = 
1-1 -1 #1 1 -1 


Vin i b+x,w, + X,W, 
+X,W, +X,W, 


Aw, = @tx,; 
l if Yin? 9 


y=f(¥n)=4 0 if y,, =0 Aw, = atx,; 
-1 if y,<0 Aw, = atx; 


Ab=at 


pid 
= = 
S 
=) 
a. 
uu 


Implement OR function using perceptron 
network for binary input and bipolar 
targets up to 3 epochs 


Find the weight using perceptron network 
for ANDNOT function when all the input 
are presented only one time. Use bipolar 
input and targets 


Find the weight required to perform the 
following classification using perceptron 
network. The vectors(1,1,1,1) and (-1,1,-1,-1) 
are belongs to the class so have the target 
value 1, vectors (1,1,1,-1) and (1,-1,-1,1) are 
not belongs to the class so have the target 
value -1. Assume learning rate 1 and initial 
weights as 0 


Using Perceptron rule, Find the weight required to 
perform the following classification of the given 
input patterns shown below 


The pattern is shown as 3X3 matrix form in the 
squares. The “+” symbol represents the value 1 
and empty square indicate -1. ae 

Consider “|” belongs to the member of the class so 
the target value is 1 and “F” does not belongs to 


the member of the class so the target value is -1 


TUTORIAL QUESTIONS 


Implement NOR function using perceptron network for 
bipolar input and targets 


Find the weight required to perform the following 
classification using perceptron network. The vectors(1,1,-1,- 
1) and (1,-1,1,-1) are belongs to the class so have the target 
value 1, vectors (-1,-1,-1,1) and (-1,-1,1,1) are not belongs to 
the class so have the target value -1. Assume learning rate 1 
and initial weights as 0 


Classify the 2D pattern shown in the figure below using 
perceptron network 


ADAPTIVE LINEAR NEURON (ADALINE) 


In 1959, Bernard Widrow and Marcian Hoff of Stanford 
developed models they called ADALINE (Adaptive Linear 
Neuron) and MADALINE (Multilayer ADALINE). These models 
were named for their use of Multiple ADAptive LiNear 
Elements. MADALINE was the first neural network to be 
applied to a real world problem. It is an adaptive filter which 
eliminates echoes on phone lines. 


a ADALINE 


>> 


A 


Adaptive Linear Neuron (Adaline) 


¢ The units with linear activation function are called linear units. 


¢ A network with a single linear unit is called an Adaline (adaptive 
linear neuron). 


¢ That is, in an Adaline, the input-output relationship is linear. 


¢ Adaline uses bipolar activation for its input signals and its target 
output. 


¢ The weights between the input and the output are adjustable. 


¢ The bias in Adaline acts like an adjustable weight, whose connection 
is from a unit with activations being always 1. 


Adaptive Linear Neuron (Adaline) 


¢ Adaline is a net which has only one output unit. 
¢ The Adaline network may be trained using delta rule. 


¢ The delta rule may also be called as /east mean square (LMS) rule or 
Widrow-Hoff rule. 


¢ This learning rule is found to minimize the mean squared error 
between the activation and the target value. 


Delta Rule for Single Output Unit 


¢ The Widrow-Hoff rule is very similar to percepton learning rule. However, their origins are 
different. 


¢ The perceptron learning rule originates 
> fromthe Hebbian assumption 
¢ The delta rule is derived 


> from the gradient- descet method (it can be generalized to more than one layer) 
Gradient descent is an optimization algorithm used to find the values of parameters 
(coefficients) of a function (f) that minimizes a cost function (cost). Gradient descent is 
best used when the parameters cannot be calculated analytically 


¢ Also, the perceptron learning rule stops after a finite number of learning steps, but the 
gradient-descent approach continues forever, converging only asymptotically to the 
solution. 


ADALINE LEARNING RULE 


Adaline network uses Delta Learning Rule. This rule is also called 
as Widrow-Hoff Learning Rule or Least Mean Square Rule. The 
delta rule for adjusting the weights is given as (i = 1 to n): 


Aw; = a(t — yin) x 


Aw; = weight change 
a = learning rate 
x = vector of activation of input unit 
n 
Vin = Net input to output unit,1.e.,Y = su 
i= 


(= target output 


ADALINE LEARNING RULE 


¥ Similar to Perceptron rule 

Y Perceptron rule is originated from Hebbian Assumption 

Vv Delta rule is from Gradient Descent method 

Y Perceptron rule stops after a finite number of learning steps 

Y Delta rule continues forever, converging only asymptotically to the solution 


Y Delta rule update the weight between the connection so as to minimize the 
difference between the net input to the output unit and the target value 


¥ Major aim is to minimize the error over all training patterns 


perceptron Det 


Originates from hebbian assumption Derived from gradient-descent method 


Stops after a finite number of learning Continuous forever converging 
steps asymptotically to the solution 


Minimizes error over all training 
patterns 


ARCHITECTURE 


Y Adaline is a single unit neuron, which receives the input from 
several units and also from units called bias 


Y Basic Adaline model consist of trainable weights 


VY Input are either of 2 values(+1 or -1) and the weight have 
signs(positive or negative) 
Vv Initially random weights are assigned 


¥Y The net input is calculated is applied to a quantizer transfer 
function that restore the output to +1 or -1 


¥ Adaline model compare the actual output with the target 
output and on the basis of training algorithm, the weights are 
adjusted 


ARCHITECTURE 


x0=1 


' Learning 
' supervisor 


—— 


TRAINING ALGORITHM 


ADALINE TRAINING ALGORITHM 


$tep 0: Weights and bias are set to some random values other than zero. Learning rate C is set 
$tep 1s Perform Steps 2-6 when stopping condition is false. 
$tep 2: Perform steps 3-5 for each bipolar training pair s:t 
$tep 3s Set activations for input units i=1 to n xi=si 
$tep 4: Calculate the net input to the output unit 
n 
Yin = b+ > Xiwi 
i=1 
$tep 5: Update the weights and bias for i=1 to n 
wi (new) = wi(old)+ a(t — yin)xi 


b(new )= b(old )+a(t—- yin) 


$tep 6: If highest weight change that occurred during training is smaller than a specified 
tolerance then stop the training else continue. (Stopping condition) 


TESTING ALGORITHM 


¢ Step O: Initialize the weights(from training algo) 

¢ Step 1: Perform steps2-4 for each bipolar input vector x 
¢ Step 2: Set the activations of the input units to x 

¢ Step 3: Calculate the net input 


yin = b+ > XiWi 


¢ Step 4: Apply the activation function over the net input 
calculated 
y= 1 If y;, >= 0 
-1 If y;,< 0 


Design ADALINE for OR Function 


¢ Initially, all weights are assumed to be small random values, say 0.1, 
and set learning rule to 0.1. 


¢ Also, set the least squared error to 2. 


¢ The weights will be updated until the total error is greater than the 
least squared error. 


x4 X2 t 
1 1 1 
i -1 1 
Al i. 1 


° Calculate the net input 
yin = b+ > XiWi 


° Yi,-0.1*1+0.1*1+.1 = 0.3 


¢ Now compute, (t-y,,)=(1-0.3)=0.7 
¢ Now, update the weights and bias 


wi(new) = w;(ald) + (t = yin ai 
wy (new) = 0.1 +0.1(1 -0.3)1 = 017 
wy(new) = 0.1 +0.1(1 -0.3)1 = 0.17 

bnew) = Dold) + (t - yin} 
iinew) = 0.1+0.1(1 = 0.3) = 0.17 


¢ calculate the error 


error = (t- in) = 0.77 = 0.49 


¢ Similarly, repeat the same steps for other input vectors and you will 


get. 


X41 


X2 


(t-Yin) 


0.7 
0.83 


0.913 


-1.0043 


AW, 


0.07 


0.083 


-0.0913 


0.1004 


Aw? 


0.07 


-0.083 


0.0913 


0.1004 


Ab _w;(0.1) 
0.07 0.17 
0.083 0.253 


0.0913 0.1617 


-0.1004 0.2621 


0.17 
0.087 
0.1783 


0.2787 


b (0.1) 


0.17 
0.253 
0.3443 


0.2439 


TUTORIAL QUESTIONS 


Use ADALINE network to train ANDNOT function 
with bipolar inout and targets. Perform 2 epoch of 
training 

Implement AND function using ADALINE 


Using Delta rule find the weight required to 
perform following classifications: Vectors(1,1,-1,-1) 
and (-1,-1,-1,-1) belongs to the class having target 
value 1: vectors (1,1,1,1) and (-1,-1,1,-1) re not 
belong to the class having target value -1.Use 
learning rate of 0.5 and assume random values 
weight. Also test the network 


Back-Propagation Network 


¢ The back-propagation learning algorithm is one of the most important 
developments in neural networks (Bryson and Ho, 1969; Werbos, 1974; 
Lecun, 1985; Parker, 1985; Rumelhan, 1986). 


¢ This network has reawakened the scientific and engineering community to 
the modeling and processing of numerous phenomena using neural networks. 


¢ This learning algorithm is applied to multilayer feed-forward network 
consisting on processing elements with continuous differentiable activation 
functions. 


¢ The networks associated with back-propagation learning algorithm are also 
called back-propagation network (BPNs). 


Back-Propagation Network 


¢For a given set of training input-output pair, this algorithm provides a 
procedure for changing the weights in a BPN to classify the given input 
patterns correctly. 


¢ The basic concept for this weight update algorithm is simply the gradient- 
descent method as used in the case of simple perceptron networks with 
differentiable units. 


¢ This is a method where the error is propagated back to the hidden unit. 
¢ The aim of the neural network is 


> to train the net to achieve a balance between the net's ability to 
respond (memorization) and its ability to give reasonable responses to 
the input that is similar but not identical to the one that is used in 
training (generalization). 


Back-Propagation Network 


¢ The back-propagation algorithm is different from other networks in respect to 
the process by which the weights are calculated during the learning period of 
the network. 


¢The general difficulty with the multilayer perceptron is calculating the 
weights of the hidden layers in an efficient way that would result in a very 
small or zero output error. 


¢ When the hidden layers are increased the network training becomes more 
complex. 


Back-Propagation Network 


¢ To update weights, the error must be calculated. The error, Which is 
the difference between the actual (calculated) and the desired 
(target) output, is easily measured at the output layer. 


¢ It should be noted that at the hidden layers, there is no direct 
information of the error. 


¢ Therefore, other techniques should be used to calculate an error at 
the hidden layer, which will cause minimization of the output error, 
and this is the ultimate goal. 


Back-Propagation Network 


¢ The training of the BPN is done in three stages — 
> the feed-forward of the input training pattern, 
> thecalculation and back-propagation of the error, 
> updation of weights. 


Back-Propagation Network-Architecture 

¢ A back-propagation neural network is a multilayer, feed forward neural network consisting 
of an input layer, a hidden layer and an output layer. 

¢ The neurons present in the hidden and output layers have biases, which are the connections 
from the units whose activation is always 1. 

¢ The bias terms also acts as weights. 


¢ Figure shows the architecture of a BPN, depicting only the direction of information flow for 
the feed forward phase. 


Back-Propagation Network-Architecture 


¢ During the back propagation phase of learning , signals are sent in the reverse 
direction 


¢ The inputs sent to the BPN and the output obtained from the net could be 
either binary (0, 1) or bipolar ( -1, + 1). 


¢ The activation function could be any function which increases monotonically 
and is also differentiable. 


Pigure “owchitecruce of a back-propagation ocoereork 


Back-Propagation Network-Architecture 


Architecruce of a back-propagation network. 


x = inpur training vector (x), ..- 5 Xf. -+- + XJ 
£ = tarper ourpuc vector (ry... . 5 Fes ees stot — 
o = learning rate paramerer 


xj = input unit 4 (Since the input leper uses identiry activation function, the input and oucpurt signals 
here are same.) 


ny = bias on yeh hidden unit 
woe = bias on th ourpur wnic 
z; = hidden unit jy. The net inpit to 2; is 


fT 
Zing = Moy + > MPU iy 


s=tL1 
and che output is 


a5 = (Zing) 


ye = output unie & The net input co yy, Is 


P 
Fink = Boe + > | B5jtjh 
f= 
and the output is 


Se = fi (Yirnke) 


84 = error correccion weight adjustment for wz chat is duc to an error at output unit yg, which is 
back-propagaced co the hidden units char feed inco unit ve 

&; = error correction weight adjustment for #;; thac is due ro the back-propagation of error to the 
hidden UME zs. 


Initialize weights to 
© some random values 


For 
each 
training 


Receive input signal xi & 
transmit to hidden unit 


In hidden unit calculate o/ 


p 


Send zj to output layer unit 


Calculate o/p signal from o/p 
Yink=Woxt 2 ZiWjx 
Yie=F(Vink) 


layer 


Target pair tk enters 


Compute error correlation factor between o/p& hidden 


Update weight and bias on output unit 


Woiki(new) = wjk(old) + A wyjk 


Update weight and bias on hidden unit 
Wi (new) = vi Cold) + Avi 


If specified 
no of epochs 
reaches or 
tk=yk 


Yes 


Training Algorithm 


¢ Step 0: Initialize weights and learning rate (take some small random values). 
¢ Step 1: Perform Steps 2-9 when stopping condition is false. 


¢ Step 2: Perform Steps 3-8 for each training pair. 


Feed-forward phase (Phase I) 
Step 3: Each input unic receives input signa! x; and sends it co che hidden unit (f = | to »). 
Step 4: Each hidden unit z;(j = 1 to p) sums its weighted input signals to calculate net input: 


ft 
Zinj = My + So xy 


r=l 
Calculate output of the hidden unit by applying its activation functions over zj,; (binary or bipol: 
sigmoidal activation function): 
gj = Hg (Zing) 
and send the oucput signal from the hidden unit to the input of ourput ayer units. 
Step 5: For each output unit yy (4 = 1 co m), calculate the net inpuc: 


Pp 
Vink = Woy + > zm 


jl 


and apply the activation function to compute output signal 


yt = Ff tind) 


Back-propagation of errar (Phase I1)- 


Step 6: Each output unit »(% = | to vf) reccives a targec paceern corresponding to the input training 
pattern and computes the-*rroi correction (im: 


5g = (te — yell’ Vina) 


The derivative £’(y;,4) can be calculated as in Section 2.3.3. On the basis of the calculated error 
correction term, update che change in weights and bias: 


\: 
Aw, = od, z;, Aw = Gb; 


Also, send 8 to the hidden layer backwards. 
Step 7: Each hidden unit {z,, j = 1 to ») sums its delta inputs from the output units; 


ai 
Sing = 2B Og Wie 
k=| - 
The term Sin; gets multiplied with the derivative of f(2j,;) to calculate the error term: 
6; =Sie f (Zjnj) 
The derivative {” (zinj) can be calculated - | ' . depending on whether 


binary or bipolar sigmoidal function is used. On the basis of the calculated 6;, update che change 
in weights and bias: 


Av; = a8; x7, Auj = ad; 


, Weight and bias updation (Phase II): 
Step 8; Each output unit (y, 4= 1 to m) updates the bias and weights: 


wip (new) = wield) + Aw, 
tugs (new) = Upg(old)+ Ang 
Each hidden unit (gj = 1 to p) updates ics bias and weights: 


vjj(new) = vi(old) + Av; 
vyj{new) = vpj(old)+A v; 


Step 9; Check for the scoping condition, The stopping condition may be certain number of epochs 
| reached or when the actual ourput equals the target output. 


Back-Propagation Network-Architecture 


Weight and bias adjustment 
> Each output unit (yk, k = 1 to m) updates the bias and weights: 


wjk(new) = wjk(old)+Awjk = wjk(new) = wjk(old)+adkzj Avie ablep wok aer 


wOk(new) = wOk(old)+AwO0k = wOk(new) = wOk(old)+adék 


> Each hidden unit (zj, j= 1 to p) updates its bias and weights: 


vij(new) = vij(old)+Avij = vij(new) = vij(old)+adjxi 


Avij = adjxi; Avoj = adj; 


vOj(new) = vOj(old)+AvOj = vOj(new) = vOj(old)+adj 


* Calculate the net input: For z; layer 
Zin) = Vo] + X11 + X2021 
=0.3+0x06+1x -—0.1=0.2 
For 2p layer 


Zin2 = V02 + X1 12 + x2022 
=0.5+0x -0.3+1x0.4=0.9 


¢ Activation function used is binary sigmoidal activation 


fs) = 1+e* 


Applying activation to calculate the output, we 
obtain 


] l 
= J Zia) = ——— = — — = 0.5498 
2] ft (tint) t+ oul 4 p02 549 


l 


] 
Zz — ; —— a) == es 
2 i (Zin2) 14+ otal 14205 


= 0.7109 


Calculate the net input entering the output layer. 
For y layer 


Jin = Wo + ZW, + zw 
= —0.2 + 0.5498 x 0.4 +0.7109 x 0.1 
= 0.09101 


Applying activations to calculate the output, we 
obtain 


fe © Ta poapier = 0:5227 


Compute the error portion 4;: 


Se = (te — ye)f' (ying) 


SF (vin) =F Yin —f Yin)] = 0.522711 — 0.5227] 
(Yin) = 0.2495 


Find the changes in weights between hidden 
output layer: 7 
Aw, = a6; 2; = 0.25 x 0.1191 X 0.5498 
= 0.0164 ; 
Aw) = a6; 22 = 0.25 x 0.1191 x 0.7109 
= O:02117 
Awy = a6; = 0.25 x 0.1191 = 0.02978 


Compute the error portion 6; between input and 
hidden layer (7 = 1 to 2): 


oF = Sint F" (zins) 
Sin = DS Wie 
k=] | 
Sing =51 wy ['." only one output neuron] 


= 0in) =9] w}1 = 0.1191 x 0.4 = 0.04764 
=>5ing = 6] w21} = 0.1191 x 0.1 = 0.0119] 


Error, 5 = 8jn1 f” (Zn1) 
f (int) = f (im) 11 — f (@int)] | 
= 0.5498[1 — 0.5498] = 0.2475 
61 =Sini fF (Zm1) 
= 0.04764 x 0.2475 = 0.0118 


Error, 52 =6d;,.9 F (Zin2) 
F (@in2) = f (Zing) [1 — f (Zin2)] 
= 0.7109[1 — 0.7109] = 0.2055 
d2 =) in? i '(Zin2) 
= 0.01191 x 0.2055 = 0.00245 


Av; = a8) x, =0.25 x 0.0118 x 0=0 

Av) = 5) x) =0.25 x 0.0118 x 1 =0.00295 
Am =a5;=0.25 x 0.0118 =0.00295 
Ar12=a82x, =0.25 x 0.00245 x0=0 

Av22 = 052% =0.25 x 0.00245 x 1 =0.0006125 
42 = 25) =0.25 x 0.00245 =0.0006125 


yy (new) = vu (old)+ Aon = 0.6+0=0.6 
y12(new) = v12(old)+Av2 = —0.3+0= —0.3 
yy (new) = v9 (old)+Av21 

— —0.1 + 0.00295 = —0.09705 
vy2(new) = ¥22(old)+Av22 

= 0.4 + 0.0006125 = 0.4006125 
w (new) = wi (old)+Au; = 0.4 + 0.0164 


= 0.4164 

wy(new) = w2(old)+Aw2 = 0.1 + 0.02117 
= Qizit/s 

vp (new) = v9} (old)+Avo; = 0.3 + 0.00295 
= 0.30295 


v92(new) = v92(old)+Av02 
= 0.5 + 0.0006125 = 0.5006125 
wo(new) = wo(old)+Awo = —0.2 + 0.02978 
= —0.17022 


MADALINE NETWORK 


MADALINE is a Multilayer Adaptive Linear Element. MADALINE was the 
first neural network to be applied to a real world problem. It is used in 
several adaptive filtering process. 


MADALINE 


MADALINE 


> Two or more adaline are integrated to develop madaline 
model 


Used for nonlinearly separable logic functions (X-OR) 
function 


> Used for adaptive noise cancellation and adaptive inverse 
control 


In noise cancellation the objective is to filter out an 
interference component by identifying a linear model of a 
measurable noise source and the corresponding 
immeasurable interference. 


PECG, echo elimination from long distance telephone 
transmission lines 


ARCHITECTURE 


¥Y Simple madaline architecture consist of ‘n’ unit of input layer 
‘R’ unit of Adaline layer and one unit of madaline layer 


¥Y Each neuron in the adaline and madaline layer has a bias of 
excitation 1 


v Adaline layer is present between the input layer and madaline 
layer hence the adaline layer can be considered as hidden layer 


v The use of hidden layer gives the net computational capability 
which is not found in single layer net 


¥Y But this complicate the training process 


TRAINING ALGORITHM 


Intial & fixed weight and bias 


Set a (0 to 1) 


Activate input unit xi=si 


Update weight on unit z; whose 
Ind net input to hidden layer net input is closest to zero 
b(new)=b,(old)+a(1-z;,;) 


) x; 


Zinj = b; + » XiWijj w,=w,(old)+ o.(1-z 


inj 


Calculate the output 
Z._f(z,., i ) 


If no 
Calculate net input to output unit weight 
es changes or 
Nin = b0+ 2 ZiVj specified 
number of 
epochs 


Calculate output y=f(y,, ) 


Update weight on unit z, 
which has positive net input 
b,(new)=b,(old)+a(1-z,,,,.) 


Y Only the weight between the hidden layer and input layer 
are adjusted 


Y Weight for the output layer is fixed 


¥ The weight v1,v2,v3 .... vj and bias bo that enter into the 
output unit Y are delimited so that the response of unit Y is 1 
Y Thus the weight entering the Y unit may be taken as 
v1 = v2 =... Vj = 
Y Bias can be taken as 
bo=1/2 
Y The activation function of adaline and madaline unit are 


f(x)= 1 Ifx >=0 
-1 Ifx < O 


> Step Os Initialize the weight. The weight entering the output unit 
are set. Set initial small random values for adaline weights. Also set 
initial learning rate a 

> $tep 1s When stopping condition is false, perform step 2-3 

> Step 2s For each bipolar training pair s:t, perform 3-7 

> Step 3s Activate input layer unit. For i=1ton — xi=si 

> Step 4: Calculate the net be unit to each hidden adaline unit 

Zin j = = b; + 2 XiWij 


> Step 53s Find output of each hidden unit 
Z;-f(Z:,; ) 


ij ,j=1ltom 


> Step 6: Find output of net 
Yin = bO+ > ZV, 


in 


y=f(y;, ) 


> Step 7s Calculate the error and update the weight 
> A: if t=y , no weight updation is required 
PB: if tl=y , & t= +1 update weight on zj where net input closer to zero 
b(new)=b,(old)+0.(1-z;,,;) 
w,;=w;(old)+ ot(1-z;,;) x; 
PC: if tl=y and t =-1 update the weight on unit zk whose net input is positive 


b,(new)=b,(old)+a(1-z,,,) 

W,,=W;,(old)+ o(1-Z,,,) X; 
> $tep 8s Test for stopping condition (if there is no weight change or 
weight reaches the satisfactory level, or if a specified maximum 


number of weight updation have been performed then stop or else 
continue 


> A training procedure which allows multilayer feed forward Neural 
Networks to be trained. 


> Can theoretically perform “any” input-output mapping. 


> Can learn to solve linearly inseparable problems. 


MULTILAYER FEEDFORWARD NETWORK 


Inputs 


MULTILAYER FEEDFORWARD NETWORK: 
ACTIVATION AND TRAINING 


> For feed forward networks: 
¢ A continuous function can be 
¢ differentiated allowing 
¢ gradient-descent. 
¢ Back propagation is an example of a gradient-descent technique. 
¢ Uses sigmoid (binary or bipolar) activation function. 


In multilayer networks, the activation function is 
usually more complex than just a_ threshold function, 
like 1/[1+exp(-x)] or even 2/[1+exp(-x)] - 1 to allow for 
inhibition, etc. 


GRADIENT DESCENT 


> Gradient-Descent(training_examples, n) 


> Each training example is a pair of the form <(Xx,,...x,),t> where 
(X,,..-,X,) is the vector of input values, and t is the target output 
value, 7 is the learning rate (e.g. 0.1) 


> Initialize each wi to some small random value 
> Until the termination condition is met, Do 


¢ Initialize each Awi to zero 
¢ For each <(x,,...X,),t> in training_examples Do 


¥ Input the instance (x1,...,xn) to the linear unit and compute the 
output o 


¥ For each linear unit weight wi Do 


¢ Aw= Aw, + n (t-0) xi 
¢ For each linear unit weight wi Do 
° W=w,t+Aw, 


MODES OF GRADIENT DESCENT 


> Batch mode : gradient descent 
w=w - 7 VED[w] over the entire data D 
ED[w]=1/2=d(ty-0,)2 


> Incremental mode: gradient descent 
w=w - yn VE,[w] over individual training examples d 
Ed[w]=1/2 (t,-0,4)2 


> Incremental Gradient Descent can approximate Batch Gradient 
Descent arbitrarily closely if n is small enough. 


SIGMOID ACTIVATION FUNCTION 


Wo net=2i-9" W, Xn OF O(Net)=1/(1+eEn™) 


o(x) is the sigmoid function: 1/(1+e-x) 
do(x)/dx= o(x) (1- o(Xx)) 


Derive gradient decent rules to train: 
e one sigmoid function 

OE/ow, = -xd(td-od) od (1-od) xi 
e Multilayer networks of sigmoid units 
backpropagation 


BACK PROPAGATION NETWORK 


¢ Backprop implements a gradient descent 
search through the space of possible 
network weights, iteratively reducing the 
error E, between training example target 
values and the network outputs. 

¢« Guaranteed to converge only towards some 
local minima. 


Computer Science and Engineering, VJCET, 
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BACKPROPAGATION 


Gradient descent over entire network weight vector 

Easily generalized to arbitrary directed graphs 

Will find a local, not necessarily global error minimum -in practice 
often works well (can be invoked multiple times with different initial 


weights) 


Often include weight momentum term 
Awi,j(t)= 7 8j xi,j + a Awi,j (t-1) 


Minimizes error training examples 
Will it generalize well to unseen instances (over-fitting)? 


Training can be slow typical 1000-10000 iterations (use Levenberg- 
Marquardt instead of gradient descent) 


YBack propagation is a training method used for a 
multi layer feed forward network. 


VThe network associated with Back propagation 
learning algorithm is called BACK PROPAGATION 
NETWORK 


Y Processing elements with continuous differentiable 
activation functions 


V It is also called the generalized delta rule. 


Vit is a gradient descent method which minimizes the 
total squared error of the output computed by the 
net. 


V Any neural network is expected to respond correctly 
to the input patterns that are used for training which 
ig termed as memorization and it should respond 
reasonably to input that is similar to but not the same 
as the samples used for training which is called 
generalization. 


VThe training of a neural network by back 
propagation takes place in three stages 
Vv 1. Feed forward of the input pattern 


V2. Calculation and Back propagation of the 
associated error 


V3. Adjustments of the weights 


V After the neural network is trained, the 
neural network has to compute the feed 
forward phase only. 


V Even if the training is slow, the trained net 
can produce its output immediately. 


Computer Science and Engineering, VJCET, 
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ARCHITECTURE 


¥The output units and the hidden units can have 
biases. 


v These bias terms are like weights on connections from 
units whose output is always 1. 


Y During feed forward the signals flow in the forward 
direction i.e. from input unit to hidden unit and finally 
to the output unit. 


Y During back propagation phase of learning, the 
signals flow in the reverse direction. 


ALGORITHM 


Y The training involves three stages 
v1. Feed forward of the input training pattern 
¥ 2. Back propagation of the associated error 
v3. Adjustments of the weights. 


¥Y During feed forward, each input unit (Xi) receives an input 
signal and sends this signal to each of the hidden units Z1, Z2, 
Zn. 


¥Y Each hidden unit computes its activation and sends its signal to 
each output unit. 


YEach output unit computes its activation to compute the 
output or the response of the neural net for the given input 
pattern. 


¥Y During training, each output unit compares its computed 
activation yk, with its target value tk to determine the 
associated error for the particular pattern. 


Y Based on this error the factor 5k for all m values are computed. 


¥Y This computed 6 k is used to propagate the error at the output 
unit YR back to all units in the hidden layer. 


Y At a later stage it is also used for updation of weights between 
the output and the hidden layer. 


VIn the same way 9oj for all p values are computed for each 
hidden unit Zj. 


¥ The values of dj are not sent back to the input units but are 
used to update the weights between the hidden layer and the 
input layer. 


Y Once all the 0 factrs are known, the weights for all 
laye rs are changed simultaneously. 


YThe adjustment to all weights wjk is based on the 
factor 0k and the activation zj of the hidden unit Zj. 


Y The change in weight to the connection between the 
input layer and the hidden layer is based on oj and the 
activation xi of the input unit 


Initialize weights to 
© some random values 


For 
each 
training 


Receive input signal xi & 
transmit to hidden unit 


In hidden unit calculate o/ 


p 


Send zj to output layer unit 


Calculate o/p signal from o/p 
Yink=Woxt 2 ZiWjx 
Yie=F(Vink) 


layer 


Target pair tk enters 


Compute error correlation factor between o/p& hidden 


Update weight and bias on output unit 


Woiki(new) = wjk(old) + A wyjk 


Update weight and bias on hidden unit 
Wi (new) = vi Cold) + Avi 


If specified 
no of epochs 
reaches or 
tk=yk 


Yes 


&CTIVATION FUNCTION 


Y An activation function for a back propagation net should 
have important characteristics. 


V It should be continuous, Differentiable and monotonically 
non- decreasing. 


¥Y For computational efficiency, it is better if the derivative is 
easy to calculate. 


¥ For the commonly used activation function, the derivative 
can be expressed in terms of the value of the function itself. 
The function is expected to saturate asymptotically. 


YThe commonly used activation function is the binary 
sigmoidal function. 


TRAINING ALGORITHM 


¥ The activation function used for a back propagation neural 
network can be either a bipolar sigmoid or a binary sigmoid. 


¥Y The form of data plays an important role in choosing the type 
of the activation function. 


Y Because of the relationship between the value of the function 
and its derivative, additional evaluations of exponential 
functions are not required to be computed. 


> Step Os Initialize weights 


> Step is While stopping condition is false, do steps 2 
to 9 


> Step 2s For each training pair, do steps 3 - 8 


Feed forward 


$tep 3: Input unit receives input signal and propagates it to 
all units in the hidden layer 


$tep 4s Each hidden unit sums its weighted input signals 


$tep 5s Each output unit sums its weighted input signals and 
applied its activation function to compute its output signal. 


Back propagation 
$tep 63: Each output unit receives a target pattern corresponding to the 
input training pattern, computes its error information term 


6. = (t,- yy) f (Y_ink) 
Calculates its bias correction term 
AWok = adk 
And sends 65k to units in the layer below 


$tep 7s Each hidden unit sums its delta inputs 


% Pr . 
o_in, = Pies OW 
Multiplies by the deriv on to calculate its 
error information term 


Calculates its weight correc _ _ ae 
Avi] = G0)x1 


And calculates its bias correction term Avoy = GO] 


Update weights and biases 
$tep 8s Each output unit updates its bias and weights 


Wjk(new) = wjk(old) + A wjk 
Each hidden unit updates its bias and weights 


Vij (new) = vij (old) + Avij 


$tep 9s Test stopping condition. The stopping condition may be 
certain number of epochs reached or when the actual output 
equal to target value 


Y The above algo uses the incremental approach for updation of 
weight 


Y That is weights are being changed immediately after a training 
pattern is present 


Y There is another way of training called batch mode training 
where the weights are changed only after all the training 
pattern are presented 


¥ Batch mode requires additional local storage for each 
connection to maintain the immediate weight change 


vY When a BPN used as a classifier, it is equivalent to the optimal 
Bayesian Descriminator function for assymptotically large set of 
training patterns 


V If the BPN algorithm converges at all, then it may get stuck 
with local minima and may be unable to find a satisfactory 
solution 


¥ The randomness of the algorithm helps it to get out of local 
minima 
Y The error function may have large number of global minima 


because of permutation of weight that Reep the network input 
output function unchanged 


global maximum 


local maximum 


local minimum 


global minimum 


LEARNING FACTORS OF BACK PROPOGATION 
NETWORK 


> Training of BPN is based on the choice 
of various parameters 


> Convergence of BPN is based on some 
important learning factors such as 
> Initial weight 
>The learning rate 
>The updation rule 
> Size and nature of training set 


> Architecture (number of layers and number 
of neurons in each layer 


INITIAL WEIGHT 


> The weights are initialized with some random values 


The choice of initial weight is determines how fast 
the network converges 


> The initial weight cannot be very high becoz sigmoid 
activation function used here may get saturated from 
the beginning itself and the system may be stuck at 
the local minima 


> One method for choosing weight is in the range 


-3/Vo, 3/voi | 
> Where oi is the number of processing elements j that 
feed forward to processing element i 


> The initialization can also be done by a method called Nguyen- Widrow 
Initialization 


Leads to faster convergence of network 
> Improves the learning ability of hidden layer 


>The random initialization of weight connecting input neuron to the 
hidden neuron is obtained by 


Vij (new)=y vij(Old)/ || vj(old) || 


LEARNING RATE a 


¢ Affect the convergence of BPN 


¢ A large value of aloha may speedup the convergence but may result 
in overshooting 


¢ The range of alpha from 10’-3 to 10 has been used successfully for 
several backpropogation algorithms experiments 


¢ Slower learning rate lead to slower learning 


Momentum factor 


¢ The gradient descent is very slow if the learning rate is small and 
oscillate widely if alpha is too large 


¢ One method that allows a large learning rate without oscillation is by 
adding a momentum factor to the normal gradient descent method 


GENERALIZATION 


> A network is said to be generalized when it sensibly 
interpolate with input network that are new to the network 


> When there are many trainable parameters for a given amount 
of training data, the network learn well but does not generalize 
well 


P This is usually called over fitting or over training 


One solution is to monitor error on the test set and terminate 
the training when error increases 


> With a small number of trainable parameters, the network fail 
to learn training data set to test the data set 


> However computationally large number of nodes is capable of 
memorizing the training set at the cost of generalization 


PAs a result smaller net are preferred than larger ones 


Number of Training Data 


¢ Training data should be sufficient and proper 


¢ There exist a rule of thumb which state that the training data should cover 
the entire expected input space, and while training , the training vector pair 
should be selected randomly from the set 


Number of Hidden Layer nodes 


P If there exist more than one hidden layer in BPN, then 
the calculation performed for a single layer are repeated 
for all the layers and summed up at the end 


Pin case of all multilayer feed forward network, the size of 
hidden layer is very important 


>The number of hidden layer need for an application 
determined separately 


> The size of hidden layer is determined experimentally 


>For a network of reasonable size, the size of hidden layer 
has only be a relatively small fraction of input layer 


>For example if the network does not converges to a 
solution , it may need more hidden layers 


Testing Algorithm 


¢ Step 0: Initialize weight. The weight are taken from training algorithm 
¢ Step 1: Perform step 2- 4 for each input vector 

¢ Step 2: Set the activation of input unit for xi 

¢ Step 3: Calculate the net input to hidden unit and its output 


¢ Step 4: Now compute the output for output layer unit. Use sigmoidal 
activation function 


APPLICATIONS OF BACKPROPAGATION 
NETWORK 


> 


> 


> 


Load forecasting problems in power systems. 
Image processing. 

Fault diagnosis and fault detection. 

Gesture recognition, speech recognition. 
Signature verification. 

Bioinformatics. 


Structural engineering design (civil). 


